Near-Zero Downtime with New Postgres Failover Manager

November 16, 2016

Contributed by Jason Davis

A main pillar of the EDB Postgres™ Platform from EnterpriseDB™ (EDB) is the Management Suite, which provides capabilities for disaster recovery, high availability, monitoring, management, and tuning of PostgreSQL and EDB Postgres Advanced Server. EDB significantly advanced the Management Suite recently with the release of EDB Postgres™ Failover Manager 2.1. (Read the press release here.)

EDB Postgres Failover Manager provides highly available, fault tolerant database clusters built using PostgreSQL streaming replication to reduce downtime and keep data available when a main database fails. EDB Failover Manager provides the cluster monitoring, failure detection, and failover procedures that can be integrated into a variety of 9s-based high availability solutions.

EDB Failover Manager 2.1 advanced the Management Suite by providing a controlled switchover capability, which improves the general manageability of streaming replication clusters, and more configuration options to allow customization of failover procedures to meet various requirements.

Controlled Switchover

One of the most frequently requested features from users of EDB Failover Manager has been to enable a ‘controlled switchover.’ This is the process by which when a replica is promoted to be a master, the old master is reconfigured to become a read replica of the new master. This is an important feature that supports two common use cases: (1) it enables the testing of failover procedures, and (2) it supports a near-zero downtime maintenance schedule.

In both use cases, the new switchover option for efm promote <cluster-name> -switchover should be used during a planned maintenance window. If the standby to be promoted isn’t up to date with the master, switchover will not be allowed. During this operation, the following occurs:

  • The master database (DB1) is stopped and the master agent releases the virtual IP address.
  • The standby agent of the database to be promoted (DB2) runs an optional fencing script, and promotes the standby database to master. The agent then assigns the VIP address to the node, and runs the post-promotion script (if applicable).
  • A recovery.conf file is created on the old master (DB1) pointing to the new master node (DB2). DB1 is restarted as a standby that accepts replication from the new master.

When testing failover procedures, this function ensures that your cluster continues to operate with the proper number of replicas, without requiring a full rebuild of the old master server.

To support near-zero downtime, this capability makes it possible for you to perform planned maintenance—for example, operating system or database patch applications—with minimal impact to your cluster. We recommend performing maintenance operations on your replicas first and restarting the replica databases as needed. When the cluster is in a state of no replication lag, you can use the switchover option when promoting one of the replicas so that you can subsequently update the old master with the same maintenance applications.

More Configuration Options and Custom Hooks

The more we see customers use EDB Failover Manager, the greater the need for the system to be flexible to support various deployment configurations. In this release, we added a number of improvements that allow customers to better integrate EDB Failover Manager notifications into their existing IT infrastructures and more discretely control failover procedures. These improvements include:

  • A new property script.notification, which is used to trigger a customer-provided script that sends notifications to additional channels such as SNMP or a real-time chat room, in addition to standard email notifications that are sent out.
  • Flexible cluster management with minimum.standbys or the promotable parameter.
    • The clusterwide minimum.standbys parameter specifies how many replicas must remain in the cluster in the event of a failover. If the number of standbys drops below this setting, a replica node will not be promoted in the event of a failure. This setting is used when there is a requirement that you always have one or more standbys with an active copy of the data.
    • The node-specific promotable parameter allows EDB Failover Manager to monitor the health of the read replica, while also ensuring that it is not considered for promotion. This is useful in cases where a node may be sized with enough compute resource to operate as a reporting replica, but it is not sized to handle a primary workload.
  • The script.resumed parameter can be used to execute a script when an agent resumes monitoring. This is useful after performing maintenance operations and ensuring that certain steps are followed when an agent resumes monitoring a database.
  • Event information such as the IP address of a new primary or failed master is now passed on to fencing and post-promotion scripts to enable scripts to make better decisions and trigger event-based activities..

Next Steps

For more technical information, please read the EDB Failover Manager 2.1 Release Notes or Chapter 1.1 of the documentation. EDB Postgres Failover Manager 2.1 is available for download today.

Jason Davis is the Senior Director of Product Management at EnterpriseDB. 

Share this